N96-11209 


(NASA-CR-199282) AN IMPLEMENTATION 
AND PERFORMANCE MEASUREMENT OF THE 
PROGRESSIVE RETRY TECHNIQUE (Bell 
Telephone Labs.) 8 p 


Unclas 


G3/61 0065036 



NASA— CR-199282 


/MG/- ^-3 


y 


y/v “t? — 

£ 


An Implementation and Performance Measurement of the 

Progressive Retry Technique 




Gaurav Suri * . Yennim Huang * Yi-Min Wang* W. Kent Fuchs ^ Chandra Kintala* 


'AT&T Beii Laboratories 
600 Mountain Avenue 
Murray Hill, NJ 07974 


^ Coordinated Science Laboratory 
University of Illinois 
Urbana, IL 61801 


Abstract 

This paper describes a recovery technique called pro- 
gressive retry for bypassing software faults in message- 
passing applications. The technique is implemented, 
as reusable modules to provide application-level soft- 
ware fault tolerance. The paper describes the imple- 
mentation of the technique and presents results from 
the application of progressive retry to two telecommu- 
nications systems. The results presented show that the 
technique is helpful in reducing the total recovery time 
for message-passing applications. 

1 Introduction 

For computer systems designed to provide contin- 
uous services to customers, availability is an impor- 
tant performance measure. In such systems, software - 
failures have been observed to be the current major 
cause of service unavailability [I, 2]. Residual- software 
faults due to untested boundary conditions, unantic- 
ipated exceptions and unexpected execution environ- 
ments have been observed to escape the testing and de- 
bugging process and, when triggered during program 
execution, cause service interruption [3]. It is there- 
fore desirable to have effective on-line retry mecha- 
nisms for automatically bypassing software faults and 
recovering from software failures in order to achieve - 
high availability (4, 5, 6, 7j.- 

Several studies [2, 8, 9] have shown that many soft- 
ware failures in production systems behave in a tran- 
sient fashion, and so the simplest way to recover from 
such failures is to restart the system, an approach that 
we call environment diversity. The term Heisenbug [1] . 
has been used to refer to the software faults causing 
transient failures, while the term Bohrbug refers to 
software faults which have deterministic behavior. 

Watchd daemon and libf t library have been used 
in several AT&T products to tolerate Heisenbugs [10]. 
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Watchd is a daemon process which monitors system 
failures like machine crash, process death and pro- 
cess hang. If a machine crashes, all critical applica- 
tions running on the crashed machine are migrated 
to another machine. If a process dies (due to bugs 
in the program), watchd first restarts the process lo- 
cally. If the restarted process fails again, the pro- 
cess is then migrated to another machine. Libft 
provides functions for message logging, critical-data 
checkpointing, fault- tolerant inter-process communi- 
cation and name services. With libft, an application 
process can checkpoint its critical data on the local 
machine as well as on backup machines. Therefore, 
when a process is restarted, it can restore its check- 
pointed state and replay its message log to reconstruct 
its pre-failure state. Watchd keeps track of the depen- 
dence between processes. In the event of a failure, the 
failed process as well as all other processes that depend 
on it are rolled back in order to guarantee state consis- 
tency. Watchd and libft together provides a simple, 
portable and reusable component for an application to 
tolerate Heisenbugs. 

Our experience has shown that many errors can 
be successfully tolerated by using the simple rollback- 
and-retry mechanism provided by watchd and libft. 
The simple mechanism, although effective, presents 
some problems. First, the recovery time can be long 
so that the service disruption due to the recovery can 
be unbearable. Any process failure results in a global 
restart. In an application consisting of many pro- 
cesses, a global restart can take a long time before the 
application returns to normal execution. Therefore, 
it is desirable to limit the scope of rollback by keep- 
ing track of the dynamic inter-process communication 
patterns and rolling back only the processes which di- 
rectly communicate with the failed processes in the 
current checkpoint interval. Second, the simple retry 
with a deterministic replay of message logs usually re- 
constructs the application state to the same state as 
existed before failure. If the state is erroneous and the 
application behavior is deterministic, the retry and re- 
play will not help. However, if message dependency is 
recorded, a failed application can replay the messages 
in a different but consistent order so that the appli- 
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cation reaches a new but correct state after retry. In 
other words, by replaying the message log in a differ- 
ent but consistent order, more software failures may 
be bypassed. 

The above two observations motivate the exten- 
sion of the rollback-and-retry mechanism provided 
by vatchd and libft. This paper describes & pro- 
gressive retry technique for software failure recov- 
ery in message-passing applications 1 . The target ap- 
plications are continuously-running software systems 
for which fast recovery is essential and a reasonable 
amount of run-time overhead may not result in no- 
ticeable service quality degradation. Many telecom- 
munications systems fall into this category. There are 
several reasons that fast recovery is desirable in appli- 
cations requiring high availability. In the cases where 
service quality is judged at the user interface level, 
small “computer down time” involving only a small 
number of processes may be translated into zero “ser- 
vice down time.” Most importantly, when the pro- 
longed unavailability of one part of the system may 
trigger the boundary conditions in other parts of the 
system, localized and fast recovery can reduce the pos- 
sibility of cascading failures which may lead to a catas- 
trophe. 

The progressive retry technique is based on check- 
pointing, rollback, message replaying and message re- 
ordering. The goal is to limit the scope of rollback the 
number of involved processes as well as total rollback 
distance. The approach consists of several retry steps 
and gradually increases the scope of rollback when a 
previous retry fails. The technique is implemented in 
watchd daemon and libft library. 

2 Progressive Retry 

The simple example in Fig. 1 is used to illustrate 
the basic concept of progressive retry. The reader is re- 
ferred to [12] for a detailed discussion. For the purpose 
of presentation, we assume every message is logged be- 
fore it is processed, and is therefore available at the 
time of recovery. Suppose p 2 detects an error at the 
point marked “X” in Figure 1 and initiates th'e pro- 
gressive retry. In the Step -1 receiver replaying retry, 
P 2 rolls back and replays messages M a and Ms in ex- 
actly the same order as they were processed before 
the rollback. If the detected error was caused by some 
transient environmental problems (such as mutual ex- 
clusion conflicts, resource unavailability, unexpected 
signals, etc.) then Step -1 retry may succeed and p 2 
can proceed. Under the deterministic assumption, the 
exact same copy of M a will be generated during the 
recovery. Therefore, pi does not need to resend M 0 
and the receiver of M 0 , process p«, does not have to 
be involved in the retry. 

If Step -1 retry fails, then p 2 rolls back again and 
executes Step-2 receiver reordering retry by reordering 
M 0 and Ms in its message log. If the original error 
was triggered by a boundary condition, then message 
reordering may be useful for bypassing that condition 
and thereby recovering from the error. Note that since 

1 We will focus on error recovery in this paper; the issue of 

error detection is considered elsewhere [10]. 


Figure 1: Example for illustrating the basic concept 
of progressive retry. . . 


message reordering forces a different execution path 
for p 2 , we cannot expect that the same message M 0 
will still be generated. Such a message is called an 
orphan message [11] and should be discarded. As a 
result, p< should also be rolled back in order to undo 
the effect of M„. The message Mi, however, is not am 
orphan message because its sender is not rolled back. 
Such a “sent but not yet received message” is called am 
in-transit message [12]. It needs to be processed again 
by the restarted receiver but its associated processing 
order information can be discarded. 

There are several potentially useful algorithms for 
reordering the message logs. Random reordering can 
be used when no knowledge about the possible cause 
of the software failure is available. If the failure is 
possibly due to the interleaving of messages from dif- 
ferent processes, reordering by grouping the messages 
from the same process together may be useful. If the 
software fault might have been triggered by exhaust- 
ing all available resources, reordering the messages so 
that every resource is freed at the earliest possible mo- 
ment can often bypass the boundary condition. 

- If Step-2 retry fails, then Step-3 sender replaying 
retry will involve in the recovery process all the pro- 
cesses that have sent messages to p 2 . In Fig. 1 , pi 
(p 3 ) rolls back and replays M w and M x (M z and M y ) 
in their original order 2 . Step-3 retry basically gives 
the messages a second chance to interleave “naturally” 
with each other, and can be useful for error recovery 
if the original error was due to some rare message rac- 
ing conditions. Under the deterministic assumption, 
the exact same copy of M, will be generated and so 
po does not need to be involved in the rollback. In 
contrast, p< needs to roll back because of the orphan 
message M 0 , and Mi remains an in-transit message. 

If Step-3 retry still fails, it is suspected that an 
undetected error might have occurred at p\ or P 3 and 
was propagated to P 2 through the erroneous messages 

2 Messages Af w , M*, M v and M, are from processes other 
than the five processes shown in the Figure. 
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M a or Mb to cause the detected error. Step-4 sender 
reordering retry is designed to bypass the software bug 
that caused the undetected error. In this step, process 
pi rolls back and reorders M w and M x , and p 3 rolls 
back and reorders M t and M v . As a result, messages 
M a , Mb, Mi, M 0 and M, all become orphan messages 
and all the five processes in Fig. 1 need to participate 
in the recovery. 

When all previous small-scope retries fail, the ob- 
jective of localized recovery can no longer be achieved 
and a large-scope rollback needs to be initiated. Ail 
the processes in the system, including the senders of 
M wt M s , My and M z { are rolled back to the latest 
globally consistent set of checkpoints obtained through 
coordinated checkpointing [13] or lazy coordination 
[14]. If a system has been functioning correctly for 
most of the time and failures are rare events, rolling 
back the entire system can often recover from the fail- 
ures. The potential disadvantages of a large-scope roll- 
back include unnecessarily involving healthy critical 
processes in the rollback and a longer recovery time. 
In a later section, we show that, for certain systems, 
the cost of the first four steps of progressive retry is 
small compared to the cost of a large-scope rollback. 
For such systems, the five-step progressive retry de- 
scribed in this section is an attractive technique for 
providing low-cost and efficient software failure recov- 
ery. 

3 Implementation 

The progressive retry mechanism is implemented in 
the libft library and the t uatchd daemon [10]. The 
heart of the system is a centralized message server 
which dynamically keeps track of all the information 
required for progressive retry. The message server runs 
as a child of the watchd daemon and uses the check- 
pointing capabilities of the libft library in order to 
make its own operation fault-tolerant. The other im- 
portant components of the progressive retry mecha- 
nism are watchd, the throwback agents and the recov- 
ery management functions (see Figure 2). 

Watchd monitors user processes for failures. As 
soon as it detects a failure, it restarts the process. It 
also communicates with the message server to get in- 
.formation about the other processes that need to be 
restarted. It then kills and restarts those processes. 
The first action taken by each of the restarted pro- 
cesses is to communicate with the message server and 
find out the recovery actions that need to be taken. 
Each process then sets up its recovery status accord- 
ingly and proceeds with recovery. The following sub- 
sections explain each of these functions in detail. 

3.1 Message Server 

As described earlier, the message server is the most 
important component of the progressive retry mecha- 
nism. It has the following functions : 

• Keep track of the communication graph during 
failure-free operation, 

• Maintain status information for each process in 
the system, and 



RMF - Recovery Management Functions 


Figure 2: Progressive retry system architecture 


• Compute the recovery line during failure recovery. 


3.1.1 Dynamic Communication Graph 

The message server needs to keep track of the message 
dependencies during normal program execution in or- 
der to be able to compute the recovery line during the 
recovery process. The graph is computed on the basis 
of the pattern sent to the message server by the re- 
ceivers of messages in the system. Each process main- 
tains a local communication graph in which it keeps 
track of all the processes that have sent messages to 
it so far. Whenever it gets a message from a process 
_ that is not present in its local graph, it adds the pro- 
cess to the local graph and also sends the information 
to the message server so that the global communica- 
tion graph can be updated. 


3.1.2 Process Status Information 

When the application is recovering from the failure of 
a process under progressive retry, the status of other 
processes is also affected to some degree depending on 
the communication pattern and the stage of retry the 
system is in. The processes need to be assigned differ- 
ent status values so that they know whether they have 
to deterministically replay the pre-failure receiver log , 
reorder and replay the log, receive in-transit messages 
from the communication channel or perform as normal 
processes. The reasons for making these distinctions 
are explained in Section 3.4. 






3.1.3 Recovery Line Computation 

Recovery line computation is the first step to be car- 
ried out when a failed process restarts and progressive 
retry needs to be initiated: It involves carrying out 
the following functions: 

• Calculate the retry step number for the system 

• Analyze the communication graph, and 

• Determine the status of each affected process 
based on the first two steps 

The recovery line computation may result in a 
change of status for some of the processes in the appli- 
cation. Since ail processes run as children of watchd, 
this information is conveyed to watchd so that it can 
take appropriate action and restart processes that 
need a status change. 

3.2 Watchd 

The basic functions of watchd [10] are to period- 
ically monitor processes to see if they are alive, and 
to restart failed processes. Under progressive retry 
watchd is given the following additional responsibili- 
ties: 

• Invoke the message server to initiate progressive 
retry when a failed process is brought up again. 

• Kill and restart all the processes that require a 
status change during retry. 

3.3 Throwback Agents 

The throwback agents are invoked on a one-per- 
process basis and their function is to simulate the pres- 
ence of in- transit messages. If a process is assigned a 
status which implies that there are pending in-transit 
messages, these messages need to be resent to it during 
recovery. The process fires a throwback agent which 
analyzes the log of the process and sends back to the 
process all messages that have become in-transit ac- 
cording to the new recovery line. Once all in- transit 
messages have been re-sent, the throwback agent ter- 
minates indicating the completion of recovery at that 
node. % 

3.4 Recovery Management Functions 

The recovery management functions are a part of 

libft and are responsible for recovery initialization, 
setup and management for each process in a local man- 
ner. The functions in libft that do recovery manage- 
ment are checkpoint (), recoverO, recoveredQ, 
setlogfileO, ftrecsetupO, ftreadO and 
ftwriteO. 

ftrecsetupO is the function that sets up the re- 
covery for each process. It communicates with the 
message server to get the recovery line information. It 
then uses that information to set its local status value 
and fire a throwback agent, if required. 

The function recovered () returns a value which 
indicates which stage of recovery the system is in. The 
stages can be: doing deterministic replay, receiving in- 
transit messages, or recovery completed. The return 


value is used by the process to determine whether 
it should be receiving messages from the communi- 
cation channel or retrieving them from the log files. 
These values, in conjunction with the status values are 
also used by ftreadO and ftwriteO. The function 
ftreadO examines the status value of the process, 
and based on that value reads the next message ei- 
ther from the log or from the communication channel. 
Even when reading from the channel, a distinction is 
made for receiving in-transit messages and new mes- 
sages. In order to maintain consistency, all in-transit 
messages from a sender must be received before any 
new messages can be received from that sender and 
fifo order maintained for the in-transit messages. The 
received messages are logged before they can be pro- 
cessed. The status value determines whether they are 
logged in a temporary log file or in the regular log file. 

The function ftwriteO sends messages after log- 
ging them. The status values indicate whether mes- 
sage comparison needs to be done (in order to verify 
the deterministic execution assumption) or not, and 
whether the message actually needs to be sent out on 
the communication channel at all. 

A message in a receiver log file contains five fields: 
message sequence number, sender id, reference id, 
message size and message data. Sender id is the num- 
ber assigned by watchd to each application at start- 
up time. Sequence number , is used during message 
reordering to ensure that fifo order for messages from 
the same sender is maintained. Reference id is given 
by ftwriteO and is also used during message re- 
ordering. The message structure in the sender log is 
the same except that it contains the receiver id instead 
of sender id, and it does not contain the reference id. 

4 Experimental Results 

Performance measurements for- failure-free over- 
head and recovery time were carried out by applying 
, progressive retry to two telecommunications systems. 

' The two systems used were the REPL [15] file system, 
and a subsystem of a switched service network sys- 
tem (which we refer to only as System N due to its 
proprietary nature). 

Since most of the source code was unavailable and 
.our objective was to measure the performance, not the 
effectiveness, we implemented two simulators which 
use the same software architectures, follow the same 
communication patterns and generate the same work- 
load conditions as the actual systems. The simulators 
were also useful for doing controlled fault injection at 
specific points in the programs. 

4.1 General Experimental Setup 

The experiments for both the systems studied the 
performance under two categories : 

• measurement of failure-free overhead; 

• measurement of recovery time for different steps 
of progressive retry. 

The run time of each simulation was in the order of 
one hour or more and each measurement was averaged 
oyer four runs. 


44 



4.2 Performance measurement on System 
N 

The part of System N that we modeled has the 
communication pattern shown in Figure 3. Nodes A 
and B report to Node D every 10 seconds. C sends 
a status report to D every 2 seconds. D periodically 
reports to F which also receives reports from E every 
30 seconds and which in turn reports to G 3 . 



Figure 3: Communication pattern for the part of Sys- 
tem N under study 

System N uses a coordinated checkpointing scheme 
where process D is the coordinator. The failure free 
overhead was measured for two different checkpoint 
intervals: a checkpoint every 200 messages received 
by D and every 400 messages. Synchronous logging 
was used for both sender and receiver logging. The 
critical data sizes for each of the processes were of the 
order of a few kilobytes. 

The recovery time measurements were done for the 
checkpoint interval with 400 messages in order to get 
a worst case measure of timing. Three different failure 
instants were assumed: failure at 257c of checkpoint 
interval, at 50% of checkpoint interval and at 75% of 
checkpoint interval. The failures were injected at the 
node shown as G in the Figure. 

The observations are given in Tables 1 and 2. Ta- 
ble 1 shows the failure-free overhead for System N 
while Table 2 shows actual recovery time in seconds 
for step 1 alone, steps 1 k 2, steps 1, 2 & 3, steps 1, 

3 The timings used in the simulations were obtained from the 
specification documents of the system. 


2, 3 & 4, and step 5. Table 2 also shows the recovery 
times for the first four cases as a percentage of the 
time taken for step 5 (large-scope rollback recovery) 
and the number of processes involved at each step. 


Table 1: Failure-free overhead for System N 


Execution 

No logging/ 
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checkpointing 
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From Table 1, the run time overhead of message 
logging and checkpointing for system N is about 3%. 
In case of a failure, the time taken for doing retry is 
very low compared to the time that large-scale roll- 
back takes, as shown in Table 2. Also note that the 
number of processes involved in doing steps 1 to 4 is 
at most 2, compared to 7, which is the number of pro- 
cesses involved in step 5. In most systems, the smaller 
the number of processes involved in recovery, the less 
impact the failure has. The results shown here cou- 
pled with this observation make the progressive retry 
technique extremely attractive for use with System N. 

System N has been deployed in the field for more 
than 2 years now. Data obtained from the field has 
shown that more than 90% of the exceptions (failures) 
that occurred in the last 2 years have been successfully 
recovered by steps 1 to 3. 

4.3 Performance measurement on REPL 

REPL [15] is a collection of file system library func- 
tions and server processes that runs on a primary and 
a backup machine. Applications run on the primary 
" machine and write critical files onto the primary file 
system. The REPL library intercepts the file system 
calls, produces update messages and sends the update 
messages to the REPL server processes, which then 
transfer the update messages to the REPL processes 
on the backup node. The backup REPL processes 
replay the update messages and reproduce the file up- 
dates on the backup node. REPL has been used in 
several telecommunications systems to replicate criti- 
cal files and databases. The communication graph for 
REPL is shown in Figure 4. 

A normal workload for REPL is a burst mode work- 
load. It receives a burst of messages from an applica- 
tion, then becomes idle for some time and this cycle 
repeats over and over again. Since the burst frequency 
depends on the application that is using REPL, it is 
not possible to define one workload for the system. 
Thus the experiment has to be conducted for different 
burst frequencies. 

A standard burst size of 10 messages per burst was 
used for the experiment. The workload was varied be- 
tween 6 bursts per minute and 1.25 bursts per minute 
for measuring the recovery time. The failure free over- 
head was measured for the workload with a frequency 
of 6 standard bursts per minute, which is the worst 
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Table 2: Recovery times for System N. t:time in seconds, %:recovery time as a percentage of step 5 time, n:number 
of processes involved 
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Figure 4: Communication pattern for REPL 


case workload due to the high communication fre- 
quency. 

The failure- free overhead measurements are given 
in Table 3. Checkpoint intervals of 500 messages, 400 
messages and 250 messages per checkpoint were stud- 
ied. The checkpoint size varied from 10 Kb to about 
100Kb. The failure-free overhead for the worst case 
workload has a maximum value of 10.9%, which is ac- 
ceptable in most REPL applications. 


Table 3: Failure-free overhead for REPL 
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Type 

No logging/ 
checkpointing 

Chk&Logj 
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25U 1 

Time(s) 
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3398 
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Recovery time data was collected for message densi- 
ties of 6, 3, 2, 1.5 and 1.25 standard bursts per minute 
(see Figures 5 and 6). For each message density the 
data was collected for failure rates corresponding to 
failures at 20% of checkpoint interval, 40% of check- 
point interval, 60% of checkpoint interval and 80% of 
checkpoint interval. Fault injection was done at the 
node marked S in the Figure. Figures 5 and 6 present 
the plots corresponding to failures at 20% and 80% of 
the checkpoint interval. 

From the Figures, we observe that: 

• steps 1 and 2 have a low recovery time compared 
to that of step 5 for all the message densities; 


therefore, steps 1 and 2 are very attractive for 
various message densities; 

• step 3 or step 4 of progressive retry could save a 
lot of recovery time only if the message density is 
low; 

• the later a failure occurs in a checkpoint interval, 
the lower is the percentage of the recovery time 
compared with that of the step 5 retry. In other 
words, if a failure occurs later in a checkpoint in- 
terval of a system, progressive retry has a greater 
impact in reducing the recovery time provided the 
system successfully recovers at an early step. 

5 Concluding Remarks 

We have described a 5-step progressive retry tech- 
nique using message logging as well as checkpointing 
to limit the scope of rollback and thereby provide a 
means for achieving localized and fast recovery. The 
technique is designed for continuously-running soft- 
ware systems which can absorb a certain degree of 
performance overhead and significantly benefit from 
reduced service unavailability. Our approach, which is 
based on the piecewise deterministic execution model, 
employs message replay to reconstruct state during 
recovery, message comparison to verify whether the 
above assumption is true, and message reordering to 
introduce environment diversity. Progressive retry has 
been implemented as part of a Software Fault Tol- 
erance Platform developed at AT&T Bell Laborato- 
ries to provide automatic, economical, effective and 
efficient software failure recovery. Experiments con- 
ducted using this implementation of progressive retry 
have shown that the technique can significantly reduce 
failure recovery time while incurring only small per- 
formance overhead. Experience has also shown that 
incorporating progressive retry is easy as it requires 
adding only a few lines of code to a program. 
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Figure 5: REPL : Time taken for retry (as a percentage of large-scope rollback recovery time) vs. the retry steps 
executed: failure at 80% checkpoint interval 
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Figure 6: REPL : Time taken for retry (as a percentage of large-scope rollback recovery time) vs. the retry steps 
executed: failure at 20% checkpoint interval. 
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